triangular inequality
Appendix for " Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization "
The main notations of this paper are summarized in Table 1. Table 1: Descriptions of the main notations used in this work.Notations Descriptions N, n the total number of clients and the total sample number of each client S, S We first introduce the lemmas which will be used in our proofs. Let e be the base of the natural logarithm. The stated result in Part (b) is proved. The optimization bound is given.
6fee03d84375a159ecd3769ebbacae83-Supplemental-Conference.pdf
Convergence of stochastic gradient descent for non-smooth problems is a known result. For completeness, wereproduce and adapt ausual proof toour setting. Let us denote byF the class of functions fromX toY we are going to work with. Assumption 1 states that we have a well-specified modelF to estimate the median,i.e. Let us begin by controlling the estimation error.
Appendix for " Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization "
The main notations of this paper are summarized in Table 1. Table 1: Descriptions of the main notations used in this work.Notations Descriptions N, n the total number of clients and the total sample number of each client S, S We first introduce the lemmas which will be used in our proofs. Let e be the base of the natural logarithm. The stated result in Part (b) is proved. The optimization bound is given.
Design of the topology for contrastive visual-textual alignment
Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. In this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the negative inner product as the distance function. With this configuration, we largely improve the zero-shot classification performance of baseline CLIP models pre-trained on large-scale datasets by an average of 6.1\%.
An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices with Polynomial Scalings
We investigate random matrices whose entries are obtained by applying a nonlinear kernel function to pairwise inner products between $n$ independent data vectors, drawn uniformly from the unit sphere in $\mathbb{R}^d$. This study is motivated by applications in machine learning and statistics, where these kernel random matrices and their spectral properties play significant roles. We establish the weak limit of the empirical spectral distribution of these matrices in a polynomial scaling regime, where $d, n \to \infty$ such that $n / d^\ell \to \kappa$, for some fixed $\ell \in \mathbb{N}$ and $\kappa \in (0, \infty)$. Our findings generalize an earlier result by Cheng and Singer, who examined the same model in the linear scaling regime (with $\ell = 1$). Our work reveals an equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as a linear combination of a (shifted) Wishart matrix and an independent matrix sampled from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law. We also extend our results to cases with data vectors sampled from isotropic Gaussian distributions instead of spherical distributions.
Localization in 1D non-parametric latent space models from pairwise affinities
Giraud, Christophe, Issartel, Yann, Verzelen, Nicolas
We consider the problem of estimating latent positions in a one-dimensional torus from pairwise affinities. The observed affinity between a pair of items is modeled as a noisy observation of a function $f(x^*_{i},x^*_{j})$ of the latent positions $x^*_{i},x^*_{j}$ of the two items on the torus. The affinity function $f$ is unknown, and it is only assumed to fulfill some shape constraints ensuring that $f(x,y)$ is large when the distance between $x$ and $y$ is small, and vice-versa. This non-parametric modeling offers a good flexibility to fit data. We introduce an estimation procedure that provably localizes all the latent positions with a maximum error of the order of $\sqrt{\log(n)/n}$, with high-probability. This rate is proven to be minimax optimal. A computationally efficient variant of the procedure is also analyzed under some more restrictive assumptions. Our general results can be instantiated to the problem of statistical seriation, leading to new bounds for the maximum error in the ordering.
Adversarial robust weighted Huber regression
Sasai, Takeyuki, Fujisawa, Hironori
We propose a novel method to estimate the coefficients of linear regression when outputs and inputs are contaminated by malicious outliers. Our method consists of two-step: (i) Make appropriate weights $\left\{\hat{w}_i\right\}_{i=1}^n$ such that the weighted sample mean of regression covariates robustly estimates the population mean of the regression covariate, (ii) Process Huber regression using $\left\{\hat{w}_i\right\}_{i=1}^n$. When (a) the regression covariate is a sequence with i.i.d. random vectors drawn from sub-Gaussian distribution with unknown mean and known identity covariance and (b) the absolute moment of the random noise is finite, our method attains a faster convergence rate than Diakonikolas, Kong and Stewart (2019) and Cherapanamjeri et al. (2020). Furthermore, our result is minimax optimal up to constant factor. When (a) the regression covariate is a sequence with i.i.d. random vectors drawn from heavy tailed distribution with unknown mean and bounded kurtosis and (b) the absolute moment of the random noise is finite, our method attains a convergence rate, which is minimax optimal up to constant factor.
Towards an information-theory for hierarchical partitions
Perotti, Juan I., Almeira, Nahuel, Saracco, Fabio
Complex systems often require descriptions covering a wide range of scales and organization levels, where a hierarchical decomposition of their description into components and sub-components is often convenient. To better understand the hierarchical decomposition of complex systems, in this work we prove a few essential results that contribute to the development of an information-theory for hierarchical-partitions.
Tree Edit Distance Learning via Adaptive Symbol Embeddings: Supplementary Materials and Results
Metric learning has the aim to improve classification accuracy by learning a distance measure which brings data points from the same class closer together and pushes data points from different classes further apart. Recent research has demonstrated that metric learning approaches can also be applied to trees, such as molecular structures, abstract syntax trees of computer programs, or syntax trees of natural language, by learning the cost function of an edit distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree. However, learning such costs directly may yield an edit distance which violates metric axioms, is challenging to interpret, and may not generalize well. In this contribution, we propose a novel metric learning approach for trees which learns an edit distance indirectly by embedding the tree nodes as vectors, such that the Euclidean distance between those vectors supports class discrimination. We learn such embeddings by reducing the distance to prototypical trees from the same class and increasing the distance to prototypical trees from different classes. In our experiments, we show that our proposed metric learning approach improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes.
Finding Frequent Entities in Continuous Data
Alet, Ferran, Chitnis, Rohan, Kaelbling, Leslie P., Lozano-Perez, Tomas
In many applications that involve processing high-dimensional data, it is important to identify a small set of entities that account for a significant fraction of detections. Rather than formalize this as a clustering problem, in which all detections must be grouped into hard or soft categories, we formalize it as an instance of the frequent items or heavy hitters problem, which finds groups of tightly clustered objects that have a high density in the feature space. We show that the heavy hitters formulation generates solutions that are more accurate and effective than the clustering formulation. In addition, we present a novel online algorithm for heavy hitters, called HAC, which addresses problems in continuous space, and demonstrate its effectiveness on real video and household domains.